This static report is accompanied by a live R Shiny dashboard.
Launch interactive dashboard to explore the models described below!Healthcare expenditure represents a substantial share of economic activity in many high-income countries. In the United States, national health spending is projected to reach 20.3% of Gross Domestic Product (GDP) by 2033, up from 17.6% in 2023 (Keehan et al. 2025). Accurately modelling healthcare utilisation and expenditure is essential for forecasting budgetary pressures, planning service capacity, and designing policies that can meet the needs of an ageing population in an efficient and sustainable way.
The Agency for Healthcare Research and Quality (AHRQ) publishes the Medical Expenditure Panel Survey (MEPS), which provides detailed information on healthcare utilisation, associated expenditures, insurance coverage, and socio-demographic characteristics, all standardised to represent a full calendar year for each respondent (Agency for Healthcare Research and Quality 2023). These data enable one to quantify how healthcare use and costs vary across population groups and to identify factors associated with higher or lower spending, which is valuable for targeting interventions and evaluating potential policy reforms.
From a modelling perspective, healthcare expenditure data pose several
distributional challenges - they are non-negative, highly right-skewed (a few
patients incur massive costs), and a large proportion of patients have no
recorded healthcare usage. Understanding both whether individuals access
healthcare and, conditional on doing so, how much is spent is crucial for
characterising demand and the resulting financial burden. This report uses 2012
MEPS data to investigate two primary outcomes related to physician services: the
number of doctor consultations (dvisit), capturing
healthcare utilisation, and the annual expenditures on doctor visits
(dvexpend), capturing the associated financial cost.
The 2012 MEPS dataset contains 10,638 observations on US adults aged 18–65 years. Tables 2.1 & 2.2 show the categorical and quantitative variables for both the full sample and a subsample of individuals with positive expenditure.
The covariates include demographic characteristics
(age, gender,
ethnicity, region,
and education), socio-economic status
(income),
health indicators (BMI, self-reported
general and mental health,
hypertension, and hyperlipidemia).
The number of non-physician visits (ndvisit) was
retained as a proxy for the latent propensity to seek care.
| Variable | Category | ||
|---|---|---|---|
gender
|
Female | 0.53 | 0.62 |
| Male | 0.47 | 0.38 | |
ethnicity
|
White | 0.70 | 0.71 |
| Black | 0.21 | 0.20 | |
| Native American | 0.01 | 0.01 | |
| Others | 0.09 | 0.08 | |
education_cat_detailed
|
Less than High School | 0.22 | 0.18 |
| High School Graduate | 0.31 | 0.29 | |
| Some College | 0.24 | 0.25 | |
| College Graduate | 0.15 | 0.17 | |
| Post-Graduate | 0.09 | 0.11 | |
region
|
Northeast | 0.15 | 0.16 |
| Midwest | 0.19 | 0.21 | |
| South | 0.39 | 0.38 | |
| West | 0.26 | 0.24 | |
hypertension
|
No | 0.75 | 0.65 |
| Yes | 0.25 | 0.35 | |
hyperlipidemia
|
No | 0.77 | 0.68 |
| Yes | 0.23 | 0.32 |
| Variable | Description | ||
|---|---|---|---|
bmi
|
Body mass index | 28.03 (6.41) | 28.68 (6.79) |
age
|
Age (years) | 40.25 (13.66) | 43.32 (13.55) |
education
|
Education (no. of years) | 12.77 (2.90) | 13.13 (2.81) |
income
|
Income (USD) | 60,817 (51,451) | 65,688 (54,115) |
dvisit
|
Doctor visits | 2.13 (3.63) | 3.94 (4.14) |
ndvisit
|
Non-doctor visits | 0.94 (2.91) | 1.51 (3.62) |
dvexpend
|
Doctor expenditure (USD) | 481.07 (1,646.79) | 889.71 (2,156.92) |
ndvexpend
|
Non-doctor expenditure (USD) | 159.59 (790.31) | 266.51 (1,038.92) |
Doctor-visit expenditure (dvexpend) is semi-continuous,
characterised by a point mass at zero (45.9% of observations)
and a heavy right skew among positive values. To address this, a two-part
model was used:
For both parts, the choice of covariates was informed by a preliminary Lasso
regression using the full set of candidate predictors, combined with prior
domain knowledge. ndvisit was retained a priori in both
parts as a proxy for latent propensity to use healthcare services. The final
two-part model is shown in Equations (2.1) & (2.2).
\[\begin{align}
% Part 1: Probit
\text{Part 1: }& \nonumber\\
\mathbb{P}[\text{dvexpend}_i > 0] &= \Phi(\eta_{1i}) \nonumber \\
\eta_{1i} &= \gamma_0 + \gamma_1 \text{age}_i + \gamma_2 \mathrm{I}(\text{gender}_i) + \gamma_3 \text{bmi}_i \nonumber \\
&\quad + \sum_{k=2}^{4} \gamma_{4k} \mathrm{I}(\text{ethnicity}_i = k) + \sum_{k=2}^{4} \gamma_{5k} \mathrm{I}(\text{region}_i = k) \nonumber \\
&\quad + \sum_{k=2}^{5} \gamma_{6k} \mathrm{I}(\text{education\_cat\_detailed}_i = k) \nonumber \\
&\quad + \sum_{k=2}^{5} \gamma_{7k} \mathrm{I}(\text{general}_i = k) + \sum_{k=2}^{5} \gamma_{8k} \mathrm{I}(\text{mental}_i = k) \nonumber \\
&\quad + \gamma_9 \mathrm{I}(\text{hypertension}_i) + \gamma_{10} \mathrm{I}(\text{hyperlipidemia}_i) \nonumber \\
&\quad + \gamma_{11} \text{income}_i + \gamma_{12} \text{ndvisit}_i \tag{2.1} \\[10pt]
% Part 2: Gamma
\text{Part 2: }& \nonumber\\
\mathbb{E}[\text{dvexpend}_i &\mid \text{dvexpend}_i > 0] = \mu_i \nonumber \\
\ln(\mu_i) &= \beta_0 + \beta_1 \text{age}_i + \beta_2 \mathrm{I}(\text{gender}_i) \nonumber \\
&\quad + \sum_{k=2}^{4} \beta_{3k} \mathrm{I}(\text{region}_i = k) + \sum_{k=2}^{5} \beta_{4k} \mathrm{I}(\text{education\_cat\_detailed}_i = k) \nonumber \\
&\quad + \sum_{k=2}^{5} \beta_{5k} \mathrm{I}(\text{general}_i = k) + \sum_{k=2}^{5} \beta_{6k} \mathrm{I}(\text{mental}_i = k) \nonumber \\
&\quad + \beta_7 \mathrm{I}(\text{hypertension}_i) + \beta_8 \mathrm{I}(\text{hyperlipidemia}_i) \nonumber \\
&\quad + \beta_9 \text{income}_i + \beta_{10} \text{ndvisit}_i \tag{2.2}
\end{align}\]
Where \(\eta_{1i}\) is the linear predictor for the Probit model,
\(\gamma_j\) are the coefficients for the Probit, \(\beta_j\) are the
coefficients for the Gamma GLM with a log link, and \(\mathrm{I}(\cdot)\) is the
indicator function, equal to 1 if the condition is met and 0 otherwise.
For categorical variables
(ethnicity, region,
education_cat_detailed
general and mental
health), the summation \(\sum\) indicates that separate coefficients are estimated
for each level \(k\) relative to the reference category (\(k=1\)).
The count of doctor visits (dvisit) exhibited significant
overdispersion (Variance: 13.1 > Mean:
2.1), violating the assumptions of a standard
Poisson model. A Negative Binomial regression was fitted to
account for the excess variability and lasso screening and prior domain knowledge
were used to inform covariate choice (Equation (2.3)).
A Poisson regression model with the same covariates as the final specification was also fitted and a formal overdispersion test, comparing the Poisson residual deviance to a \(\chi^2\) distribution with the corresponding residual degrees of freedom, yielded a Pearson dispersion of 4.18 (p < 0.001). The Poisson model had a higher Akaike Information Criterion (AIC) than the Negative Binomial model (AIC\(_\text{Poisson}\) = 50386 versus AIC\(_\text{NegBin}\) = 37334), further supporting the justification for a Negative Binomial model.
\[\begin{align} % Count model: Negative Binomial \text{Count model: } \text{dvisit}_i &\sim \text{NegBin}(\mu_i, \kappa) \nonumber \\[4pt] \log(\mu_i) &= \beta_0 + \beta_1 \text{age}_i + \beta_2 \mathrm{I}(\text{gender}_i) + \beta_3 \text{bmi}_i \nonumber \\ &\quad + \sum_{k=2}^{5} \beta_{4k} \mathrm{I}(\text{general}_i = k) + \sum_{k=2}^{5} \beta_{5k} \mathrm{I}(\text{mental}_i = k) \nonumber \\ &\quad + \sum_{k=2}^{4} \beta_{6k} \mathrm{I}(\text{ethnicity}_i = k) + \sum_{k=2}^{4} \beta_{7k} \mathrm{I}(\text{region}_i = k) \nonumber \\ &\quad + \beta_8 \mathrm{I}(\text{hypertension}_i) + \beta_9 \mathrm{I}(\text{hyperlipidemia}_i) \nonumber \\ &\quad + \beta_{10} \text{income}_i + \beta_{11} \text{ndvisit}_i \nonumber \\ &\quad + \sum_{k=2}^{5} \beta_{12k} \mathrm{I}(\text{education\_cat\_detailed}_i = k) \tag{2.3} \end{align}\]
To investigate the dependence structure between the frequency of visits and the
intensity of expenditure beyond observed covariates, a copula approach was
employed following the framework of Marra and Radice (2025b). Using Sklar’s Theorem,
the joint cumulative distribution function, \(H(y_1,y_2)\), of utilisation
(\(Y_1\)) and expenditure (\(Y_2\)) can be modelled by coupling their marginal
distributions (\(F_1\) and \(F_2\)) via a copula function \(C\):
\[
H(y_1,y_2) = C\big(F_1(y_1), F_2(y_2); \theta\big)
\]
where \(\theta\) is the association parameter. A Gaussian copula was fitted to
the residuals of the marginal models for individuals with positive expenditure
using the GJRM package in R (Marra and Radice 2025a).
Kendall’s \(\tau\), a measure of rank correlation, was estimated. For a Gaussian
copula \(\tau\) is related to the latent correlation, \(\theta\), by
\[
\tau = \frac{2}{\pi}\arcsin{(\theta)}
\quad \rightarrow
\theta = \sin{\left( \frac{\pi\tau}{2} \right)}
\]
Both the Gamma and Negative Binomial models were fitted with a log link, so their coefficients are interpreted multiplicatively. For a predictor \(x_k\) with coefficient \(\beta_k\), a \(\Delta\)-unit increase in \(x_k\) multiplies the mean (conditional mean cost in the Gamma model and expected visit count in the Negative Binomial model) by \[ \exp(\Delta \beta_k). \] and the percent change is given by \[ 100\left(\exp(\Delta \beta_k)-1\right)\%. \]
The two-part model effectively captures the dual processes of access and cost. In the first stage (Probit), females, older adults, and those with chronic conditions (hypertension, hyperlipidemia) were significantly more likely to incur expenditure (Table 3.1). In the second stage (Gamma GLM), conditional on seeking care, significant disparities in expenditure emerged. A 10-year increase in age is associated with a 9% increase in conditional costs. Interestingly, males incur 17.7% lower costs than females (p < 0.001). This finding aligns with work by Bertakis et al. (2000) who suggests women have higher healthcare engagement and diagnostic usage during reproductive years and for preventive screenings.
Self-reported general and mental health showed a strong effect with individuals who report “Poor” health incurring 197% higher costs than those in “Excellent” health. These patterns are consistent with clinical expectations - individuals who are sicker and/or more health-engaged both enter the system more often and incur higher costs once they do so.
The two-part model achieves an Root Mean Square Error of $1599 and a Mean Absolute Error of $553. The predictive performance was robust at the aggregate level, with the total predicted cost within 1.8% of the actual total, although the model underestimates extreme outliers (Figure 3.1).
Figure 3.1: Distribution of prediction errors (actual cost − predicted cost) by expenditure category. The boxplots display the median and interquartile range of errors for patients grouped by their actual healthcare expenditure and the red dashed line represents a perfect prediction. The large positive errors for the highest cost categories show that the model fails to account for the full magnitude of catastrophic health expenditures.
As shown in Table 3.2, utilisation patterns mirrored
expenditure. Black and “Other” ethnic groups showed significantly lower visit
counts compared to White patients (Incident Rate Ratios < 1), consistent with
structural barriers to access and utilisation documented in the US healthcare
system (Waidmann and Rajan 2000; Macias-Konstantopoulos et al. 2023). Non-physician visits
(ndvisit) were positively associated with doctor visits
(11% increase per
visit), perhaps suggesting complementarity rather than substitution between provider
types.
| Term | Estimate | Std. Error | p-value | Estimate | Std. Error | p-value |
|---|---|---|---|---|---|---|
| (Intercept) | -0.780 | 0.084 | <0.001 | 5.634 | 0.129 | <0.001 |
| age | 0.008 | 0.001 | <0.001 | 0.009 | 0.002 | <0.001 |
| bmi | 0.006 | 0.002 | 0.009 | |||
| income | 0.000 | 0.000 | <0.001 | 0.000 | 0.000 | 0.002 |
| ndvisit | 0.113 | 0.008 | <0.001 | 0.070 | 0.007 | <0.001 |
| General Health (ref: Excellent) | ||||||
| Poor | 0.761 | 0.111 | <0.001 | 1.090 | 0.160 | <0.001 |
| Fair | 0.538 | 0.063 | <0.001 | 0.598 | 0.112 | <0.001 |
| Good | 0.301 | 0.046 | <0.001 | 0.306 | 0.089 | <0.001 |
| VGood | 0.159 | 0.041 | <0.001 | 0.023 | 0.083 | 0.779 |
| Mental Health (ref: Excellent) | ||||||
| Poor | 0.156 | 0.139 | 0.259 | 0.262 | 0.206 | 0.203 |
| Fair | 0.113 | 0.071 | 0.114 | 0.147 | 0.119 | 0.214 |
| Good | -0.053 | 0.044 | 0.223 | 0.057 | 0.081 | 0.482 |
| VGood | -0.034 | 0.039 | 0.385 | 0.087 | 0.075 | 0.243 |
| Education (ref: Less than High School) | ||||||
| High School Graduate | 0.082 | 0.037 | 0.027 | 0.142 | 0.078 | 0.069 |
| Some College | 0.241 | 0.040 | <0.001 | 0.251 | 0.081 | 0.002 |
| College Graduate | 0.358 | 0.047 | <0.001 | 0.290 | 0.092 | 0.002 |
| Post-Graduate | 0.397 | 0.058 | <0.001 | 0.409 | 0.107 | <0.001 |
| Ethnicity (ref: White) | ||||||
| Black | -0.054 | 0.034 | 0.113 | |||
| Native American | 0.179 | 0.149 | 0.232 | |||
| Others | -0.146 | 0.048 | 0.002 | |||
| Gender (ref: Female) | ||||||
| Male | -0.525 | 0.027 | <0.001 | -0.195 | 0.054 | <0.001 |
| Hyperlipidemia | ||||||
| Yes | 0.445 | 0.037 | <0.001 | 0.138 | 0.063 | 0.029 |
| Hypertension | ||||||
| Yes | 0.360 | 0.036 | <0.001 | 0.097 | 0.063 | 0.128 |
| Region (ref: Northeast) | ||||||
| Midwest | 0.034 | 0.045 | 0.455 | -0.017 | 0.084 | 0.838 |
| South | -0.098 | 0.040 | 0.013 | -0.192 | 0.077 | 0.012 |
| West | -0.178 | 0.043 | <0.001 | 0.006 | 0.082 | 0.945 |
| Term | Estimate | Std. Error | p-value |
|---|---|---|---|
| (Intercept) | -0.689 | 0.092 | <0.001 |
| age | 0.010 | 0.001 | <0.001 |
| bmi | 0.008 | 0.002 | <0.001 |
| income | 0.000 | 0.000 | <0.001 |
| ndvisit | 0.101 | 0.004 | <0.001 |
| General Health (ref: Excellent) | |||
| Poor | 1.178 | 0.095 | <0.001 |
| Fair | 0.812 | 0.064 | <0.001 |
| Good | 0.467 | 0.050 | <0.001 |
| VGood | 0.234 | 0.046 | <0.001 |
| Mental Health (ref: Excellent) | |||
| Poor | 0.588 | 0.119 | <0.001 |
| Fair | 0.368 | 0.068 | <0.001 |
| Good | 0.059 | 0.046 | 0.202 |
| VGood | 0.061 | 0.042 | 0.147 |
| Education (ref: Less than High School) | |||
| High School Graduate | 0.095 | 0.041 | 0.022 |
| Some College | 0.286 | 0.044 | <0.001 |
| College Graduate | 0.387 | 0.051 | <0.001 |
| Post-Graduate | 0.453 | 0.060 | <0.001 |
| Ethnicity (ref: White) | |||
| Black | -0.101 | 0.037 | 0.007 |
| Native American | -0.070 | 0.158 | 0.656 |
| Others | -0.126 | 0.053 | 0.018 |
| Gender (ref: Female) | |||
| Male | -0.629 | 0.029 | <0.001 |
| Hyperlipidemia | |||
| Yes | 0.369 | 0.037 | <0.001 |
| Hypertension | |||
| Yes | 0.318 | 0.037 | <0.001 |
| Region (ref: Northeast) | |||
| Midwest | 0.031 | 0.047 | 0.516 |
| South | -0.133 | 0.043 | 0.002 |
| West | -0.193 | 0.046 | <0.001 |
The copula analysis of positive spenders revealed a moderate-to-strong positive dependence between frequency and severity. This implies that patients who have more doctor’s appointments also tend to have higher-than-expected costs per visit, suggesting a compounding resource burden for patients with higher health care demands.
As shown in Figure 3.2, there is a clear positive association between the number of visits and total expenditure. While some mechanical correlation is expected (more visits naturally equal more cost), the estimated Gaussian copula parameter (Kendall’s \(\hat{\tau} =\) 0.522, \(\hat{\theta} =\) 0.731) indicates a dependence that extends beyond simple accumulation. This finding is in agreement with recent work by Marra and Radice (2025b) who used the same MEPS data, and also found that ‘heavy’ patient’s are distinct, not only in their visit frequency but also in their resource consumption intensity. This supports the use of a joint frequency-severity modelling framework over independent models.
Figure 3.2: Joint distribution of healthcare utilisation and expenditure for patients with at least one doctor’s visit. The vertical banding reflects the discrete nature of visit counts and the y-axis shows total doctor-visit expenditure on a logarithmic scale to accommodate skewness. The overlaying positive trend highlights that expenditure increases with visit frequency.
This analysis highlights the multifaceted drivers of healthcare demand and confirms that the decision to seek care and the resulting cost intensity are driven by distinct, but overlapping factors. Significant gender and health-status gaps reinforce the need for risk-adjustment models that explicitly account for biological and systemic usage differences (Bertakis et al. 2000).
A key strength of this study is the application of a joint frequency-severity framework. The estimated Gaussian copula parameter (\(\hat{\tau}\) = 0.522) aligns with recent work by Marra and Radice (2025b) on the same MEPS dataset. Marra and Radice (2025b), despite using a more advanced Rotated Copula with Gumbel and Dagum models, reported a similar dependence (\(\hat{\theta}\) = 2.67 corresponding to \(\hat{\tau}\) = 0.6251). This report’s replication of this finding using a parsimonious Gamma-Gaussian framework further validates the link between visit frequency and cost intensity as a fundamental structural feature of healthcare demand, rather than a model artefact.
Interpreting these coefficients as causal effects is constrained by observational biases. While the analysis identified ethnic disparities, chiming with the work of Macias-Konstantopoulos et al. (2023) on structural limits to healthcare access, the omission of insurance data acts as a confounding variable. Coefficients for income and ethnicity likely absorb variance related to coverage types, potentially biasing socio-economic conclusions. The positive association with non-doctor visits likely reflects simultaneity bias, where unobserved health shocks drive demand for both physician and non-physician care, rather than a clean substitution effect. In addition, selection bias is introduced by conditioning the Copula only on positive spenders, limiting generalisability to the extensive margin.
From a statistical standpoint, the GLM framework imposes strict linearity, missing non-linear health deterioration thresholds (such as rapid age-BMI interactions). Diagnostic plots also reveal that the Gamma distribution and Gaussian Copula underestimate the extreme right tail and enforce symmetric dependence. Future work may address these catastrophic costs, potentially by using tree-based machine learning to detect non-linearities, or by integrating actuarial Extreme Value Theory (EVT).
The transition from univariate averages to multivariate dependence structures is not just a statistical refinement, but a necessity for financial sustainability. Accurately modelling the compound risk of the high-cost patients through EVT and Copulas will enable health systems to better price the heavy tail of demand, ensuring the long-term viability of the health and insurance ecosystems that support the health of the many.
The Gamma model with log link handles the skewness of positive expenditures. The residual plots in Figure A.1 show minor deviations at the largest fitted values.
Figure A.1: Diagnostic plots for the Gamma GLM (Expenditure).
Figure A.2: Deviance residuals versus fitted values for the Gamma GLM of conditional doctor-visit expenditure. Each point represents a respondent and the dashed line indicates a residual value of zero.
The Negative Binomial model accounts for overdispersion in visit counts. The diagnostics indicate a reasonable fit for the count data structure.
Figure A.3: Diagnostic plots for the Negative Binomial (Utilisation).
Figure A.4: Deviance residuals versus fitted values for the Negative Binomial regression of doctor-visit counts.
Deviance residuals for utilisation show no strong systematic pattern, with the residuals centred around the dashed horizontal line at zero, confirming the model structure is appropriate. The plot displays distinct curve-like striations, which reflect the discrete integer nature of the visit count data. Most observations cluster near zero, though some large positive residuals occur at lower fitted values, indicating occasional under-prediction where patients had significantly more visits than the model anticipated. The variance remains relatively stable across the range of fitted values, confirming that the Negative Binomial distribution successfully handles the overdispersion.
To complement the regression analysis, address space constraints, and in a optimistic pursuit of potential extra credit, an interactive R Shiny web application was developed to allow one to explore the models described in this report further. This dashboard, or web application, allows one to dynamically input specific patient characteristics and observe the resulting predictions from the fitted models. It is accessible via:
Figure B.1: Annotated interface of the MEPS Healthcare Model Dashboard. The interface highlights (1) a button that runs the model for the selected parameters, (2) patient covariate configuration, (3) results navigation, and (4) the predicted expenditure and utilisation outcomes.
Figure B.1 shows an annotated view of the dashboard, including:
An accessible HTML version of this report is available via a public GitHub page:
This report was created in R Markdown. The source code is open source and available via:
The report R Markdown file was used to produce a tex file and the tool
texcount was used to count the words within each section.
The output of texcount is shown below. Excluding the appendix, Figures and
Tables the word count is
1554 (+3.6%).
texcount smm634-individual-assignment.tex
File: smm634-individual-assignment.tex
Encoding: utf8
Words in text: 2279
Words in headers: 59
Words outside text (captions, etc.): 307
Number of headers: 18
Number of floats/tables/figures: 11
Number of math inlines: 32
Number of math displayed: 6
Subcounts:
text+headers+captions (#headers/#floats/#inlines/#displayed)
4+13+0 (1/0/0/0) _top_
241+1+0 (1/0/0/0) Section: Introduction
78+1+38 (1/2/0/0) Section: Methods
204+2+0 (1/0/7/1) Subsection: Health-care expenditure
218+2+0 (1/0/13/3) Subsection: Health-care utilisation
55+1+0 (1/0/4/2) Section: Results
195+5+62 (1/1/0/0) Subsection: Model 1: Health care expenditure
62+5+57 (1/2/2/0) Subsection: Model 2: Health care utilisation
139+2+50 (1/1/2/0) Subsection: Dependence structure
362+3+8 (1/0/4/0) Section: Discussion and conclusion
0+1+0 (1/0/0/0) Section: Appendices
0+2+0 (1/0/0/0) Section: Model Diagnostics
101+4+35 (1/2/0/0) Subsection: Expenditure Model (Gamma GLM)
158+4+20 (1/2/0/0) Subsection: Utiliation model (Negative Binomial)
158+5+37 (1/1/0/0) Section: Bonus exploration: R Shiny dashboard
0+4+0 (1/0/0/0) Section: Reproducibility, accessibility, \& word count
30+2+0 (1/0/0/0) Subsection: Reproducibility \& accessibility
268+2+0 (1/0/0/0) Subsection: Word count
...Calculated using \(\tau = 1 - \frac{1}{\theta}\) for the Gumbel Kendall’s tau relationship.↩︎